Cluster analysis in R: determine the optimal number of clusters
Asked Answered
A

8

463

How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?

n = 1000
kk = 10    
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)    
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1)) 
randObs <- function()
{
  ix = sample( 1:length(x4), 1 )
  iy = sample( 1:length(y4), 1 )
  rx = rnorm( 1, x4[ix], runif(1)/8 )
  ry = rnorm( 1, y4[ix], runif(1)/8 )
  return( c(rx,ry) )
}  
x = c()
y = c()
for ( k in 1:n )
{
  rPair  =  randObs()
  x  =  c( x, rPair[1] )
  y  =  c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )
Alexi answered 13/3, 2013 at 2:39 Comment(2)
If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.Olgaolguin
See also stats.stackexchange.com/q/11691/478Eximious
B
1075

If your question is "how can I determine how many clusters are appropriate for a kmeans analysis of my data?", then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.

First, some reproducible data (the data in the Q are... unclear to me):

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)

enter image description here

One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:

mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

We might conclude that 4 clusters would be indicated by this method: enter image description here

Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.

library(fpc)
pamk.best <- pamk(d)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(d, pamk.best$nc))

enter image description here enter image description here

# we could also do:
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(d, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
# still 4

Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
# 5 clusters!

enter image description here

Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models

# See http://www.jstatsoft.org/v18/i06/paper
# http://www.stat.washington.edu/research/reports/2006/tr504.pdf
#
library(mclust)
# Run the function to see how many clusters
# it finds to be optimal, set it to search for
# at least 1 model and up 20.
d_clust <- Mclust(as.matrix(d), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
# 4 clusters
plot(d_clust)

enter image description here enter image description here enter image description here

Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800

library(apcluster)
d.apclus <- apcluster(negDistMat(r=2), d)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
# 4
heatmap(d.apclus)
plot(d.apclus, d)

enter image description here enter image description here

Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:

library(cluster)
clusGap(d, kmeans, 10, B = 100, verbose = interactive())

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering Gap statistic ["clusGap"].
B=100 simulated reference sets, k = 1..10
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW   E.logW        gap     SE.sim
 [1,] 5.991701 5.970454 -0.0212471 0.04388506
 [2,] 5.152666 5.367256  0.2145907 0.04057451
 [3,] 4.557779 5.069601  0.5118225 0.03215540
 [4,] 3.928959 4.880453  0.9514943 0.04630399
 [5,] 3.789319 4.766903  0.9775842 0.04826191
 [6,] 3.747539 4.670100  0.9225607 0.03898850
 [7,] 3.582373 4.590136  1.0077628 0.04892236
 [8,] 3.528791 4.509247  0.9804556 0.04701930
 [9,] 3.442481 4.433200  0.9907197 0.04935647
[10,] 3.445291 4.369232  0.9239414 0.05055486

Here's the output from Edwin Chen's implementation of the gap statistic: enter image description here

Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.

Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.

library(NbClust)
nb <- NbClust(d, diss=NULL, distance = "euclidean",
        method = "kmeans", min.nc=2, max.nc=15, 
        index = "alllong", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
# Looks like 3 is the most frequently determined number of clusters
# and curiously, four clusters is not in the output at all!

enter image description here

If your question is "how can I produce a dendrogram to visualize the results of my cluster analysis?", then you should start with these:

http://www.statmethods.net/advstats/cluster.html

http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis

http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html

Here are a few examples:

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist))           # apply hirarchical clustering and plot

enter image description here

# a Bayesian clustering method, good for high-dimension data, more details:
# http://vahid.probstat.ca/paper/2012-bclust.pdf
install.packages("bclust")
library(bclust)
x <- as.matrix(d)
d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
# I just include the dendrogram here

enter image description here

Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):

library(pvclust)
library(MASS)
data(Boston)
boston.pv <- pvclust(Boston)
plot(boston.pv)

enter image description here

Bluepencil answered 13/3, 2013 at 3:20 Comment(19)
For the last dendogram (Cluster Dendogram with AU/BP) sometimes it's convenient to draw rectangles around the groups with relatively high p-values: pvrect(fit, alpha=0.95)Suu
This is exactly what I was looking for. I am new to R and it would have taken me very long to find this. Thanks @Bluepencil for answering in such detail. Can you please guide me as to where can I find the logic behind each one of these methods, like what metric or criterion they are using to determine the optimal number of clusters, or how is each one of them different from each other. My boss wants me to tell that, so we can decide which one of the methods to use. Thanks in advance.Wellmannered
Yes, you can inspect the source code for all of these functions to find out the logic, metrics, and difference from each other. Beyond that you might find a few stats textbooks useful.Bluepencil
Why did you wrote the following code in answer One? wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) Am I wrong or wss get rewritten in the loop? TnxSnowslide
For clustering there is now also the freqparcoord library.Parbuckle
Great answer! Visualization methods are nice, however, I'd be interested in purely analytical approach to select the optimal number of clusters as an input for further analysis of Gaussian mixture models. I've tried using MclustDA(), but it produces error "cannot allocate vector of size 1.8 Gb", which is strange since my train data is a numeric vector of only 26970 elements. Why? I've discovered an interesting and promising approach, described here (smtp.intjit.org/journal/volume/12/7/127_2.pdf), but failed to find its implementation, preferably in R. What do you think about it?Masakomasan
The only analytical method from your answer that I could use is the one, using pam() function from the fpc package. I just tried it with 50K numeric vector, but it produced an error "cannot allocate vector of size 10.8 Gb". Could you clarify this? I'm still interested in your opinion on the approach referenced in my previous comment. Thank you!Masakomasan
@superagio that line gives the value for wss[1], you wont be able to see the 'elbow' clearly without it.Bluepencil
@AleksandrBlekh thanks for your kind note. Probably those functions involve a lot of copying of your data, which is why you run out of memory. You could rewrite them to minimize copying, or use simpler functions that do less copying, or sample your data down until it can fit within your RAM. I'm not familiar with that paper, but it sounds like it has some similarities to the Mclust function that you already tried.Bluepencil
My pleasure! Thank you for your feedback as well. Despite this being an interesting topic, I'm now questioning the feasibility of my efforts on using Gaussian mixture analysis to determine components, considering that I'm doing factor analysis for my research anyway.Masakomasan
@Bluepencil I have a question that might be more conceptual, but I was not able to find answer, so I must bump into this post, hoping that you could help me about it, since I was amazed by your answer. I do not work with R,but I do have a problem determining number of clusters.If I try with the elbow method-I have to rerun algorithm while increasing number of clusters.How can I know how much does the random initialization influence the decrease of function?Maybe the decrease would be greater because of better initialization? It would be great if you could answer. ThanksWestering
@eugen that question is a bit much for this small comment space. Try asking it as a question on CrossValidated.Bluepencil
@Aleksandr Blekh You could also try to turn any graphical method to the analytical. E.g., I use "elbow" method (first mentioned in the answer), but try to find it analytically. The elbow point could be the point with maximum curvature. For discrete data, it is the point with maximum second order central difference (analog to max. second order derivative for continuous data). See https://mcmap.net/q/81364/-compute-the-39-elbow-39-for-a-curve-automatically-and-mathematically and https://mcmap.net/q/81363/-finding-the-best-trade-off-point-on-a-curve/1075993. I guess that other graphical methods could be converted to analytical as well.Hasheem
@AndreySapegin: I could, but: 1) frankly, I don't consider it an elegant solution (IMHO, in most cases, visual methods should remain visual, while analytical ones should remain analytical); 2) I've figured out analytical solution to this, using one or several R packages (it's on my GitHub - you're welcome to take a look); 3) my solution seems to work well enough, plus, it's been a while and I've already finalized my dissertation software, dissertation report (thesis) and currently I'm preparing for the defense :-). Regardless, I much appreciate your comment and links. All the best!Masakomasan
Which method takes into consideration outliers?Minton
very nice answer but all depended on ready software i dont see any algorithm explanation here :(Horrid
2.2 million rows are in my current clustering dataset. None of these R packages work on it, I expect. They just pop my computer and then it falls over from my experience. However, it looks like the author knows his stuff for small data and for the general case without regard to software capacity. No points deducted due to obvious good work by the author. Y'all please just know plain old R is horrible at 2.2 million rows -- try it yourself if you dont trust me. H2O helps but is limited to a small walled garden of happy.Titration
I tried with clusGap(mydata, kmeans, 10, B = 100, verbose = interactive()) where mydata is just 200000 rows data of two columns. It failed and showed message like Error: cannot allocate vector of size 129.1 Gb In addition: Warning messages: 1: In dist(xs) : Reached total allocation of 32586Mb: see help(memory.size) though I have 32gb RAM with 8 core cpu!!Kinescope
Great answer, in the NbClust example, shouldn't it be NULL, without quotes?Mineralogist
C
24

It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist)) 
clusters <- identify(hclust(d_dist))

identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).

Cracked answered 19/4, 2016 at 21:6 Comment(0)
M
14

In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:

Elbow method

elbow.k <- function(mydata){
dist.obj <- dist(mydata)
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
return(k)
}

Running Elbow parallel

no_cores <- detectCores()
    cl<-makeCluster(no_cores)
    clusterEvalQ(cl, library(GMD))
    clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
 start.time <- Sys.time()
 elbow.k.handle(data.clustering))
 k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
    end.time <- Sys.time()
    cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)

It works well.

Metabolism answered 9/8, 2016 at 8:38 Comment(2)
The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdfLoats
GMD is no longer available for latest versions of R, is there a replacement?Takeo
W
11

A simple solution is the library factoextra. You can change the clustering method and the method for calculate the best number of groups. For example if you want to know the best number of clusters for a k- means:

Data: mtcars

library(factoextra)   
fviz_nbclust(mtcars, kmeans, method = "wss") +
      geom_vline(xintercept = 3, linetype = 2)+
      labs(subtitle = "Elbow method")

Finally, we get a graph like:

enter image description here

Woolsack answered 12/12, 2018 at 21:47 Comment(0)
T
9

Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:

Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.

So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:

library(apcluster)
apclus = apcluster(negDistMat(r=2), data)
show(apclus)

If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:

sim = corSimMat(data, method="spearman")
apclus = apcluster(s=sim)

Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:

sim = cor(data, method="spearman")

or

sim = cor(t(data), method="spearman")

depending on what you want to cluster on your matrix (rows or cols).

Triiodomethane answered 12/4, 2017 at 20:8 Comment(0)
T
8

These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.

A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.

First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.

library(RWeka)

# Print a list of available options for the X-Means algorithm
WOW("XMeans")

# Create a Weka_control object which will specify our parameters
weka_ctrl <- Weka_control(
    I = 1000,                          # max no. of overall iterations
    M = 1000,                          # max no. of iterations in the kMeans loop
    L = 20,                            # min no. of clusters
    H = 150,                           # max no. of clusters
    D = "weka.core.EuclideanDistance", # distance metric Euclidean
    C = 0.4,                           # cutoff factor ???
    S = 12                             # random number seed (for reproducibility)
)

# Run the algorithm on your data, d
x_means <- XMeans(d, control = weka_ctrl)

# Assign cluster IDs to original data set
d$xmeans.cluster <- x_means$class_ids
Tremolite answered 27/12, 2017 at 15:7 Comment(0)
W
4

The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.

> set.seed(2)
> x=matrix(rnorm(50*2), ncol=2)
> hc.complete = hclust(dist(x), method="complete")
> plot(hc.complete)

enter image description here

Depending on how many classes you need you can cut your dendrogram as;

> cutree(hc.complete,k = 2)
 [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
[26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2

If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).

Weatherboarding answered 22/8, 2017 at 23:56 Comment(1)
I prefer Wards over complete.Cobia
N
0

It is very confusing to browse through so many functions without considering the performance factor. I understand that few functions in available packages are doing a lot many things than just finding the optimal number of clusters. Here are the benchmark results of these functions for anyone who consider these functions for his/ her project -

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))


mydata <- d
require(cluster)
require(vegan)
require(mclust)
require(apcluster)
require(NbClust)
require(fpc)

microbenchmark::microbenchmark(
  wss = {
    wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
    for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
  },
  
  fpc = {
    asw <- numeric(20)
    for (k in 2:20)
      asw[[k]] <- pam(d, k) $ silinfo $ avg.width
    k.best <- which.max(asw)
  },
  fpc_1 = fpc::pamk(d),
  
  vegan = {
    fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
    plot(fit, sortg = TRUE, grpmts.plot = TRUE)
    calinski.best <- as.numeric(which.max(fit$results[2,]))
  },
  
  mclust = {
    d_clust <- Mclust(as.matrix(d), G=1:20)
    m.best <- dim(d_clust$z)[2]
  },
  d.apclus = apcluster(negDistMat(r=2), d),
  clusGap = clusGap(d, kmeans, 10, B = 100, verbose = interactive()),
  NbClust = NbClust(d, diss=NULL, distance = "euclidean",
                method = "kmeans", min.nc=2, max.nc=15, 
                index = "alllong", alphaBeale = 0.1),
  
  
  times = 1)
Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval
      wss    16.83938    16.83938    16.83938    16.83938    16.83938    16.83938     1
      fpc   221.99490   221.99490   221.99490   221.99490   221.99490   221.99490     1
    fpc_1    43.10493    43.10493    43.10493    43.10493    43.10493    43.10493     1
    vegan  1096.08568  1096.08568  1096.08568  1096.08568  1096.08568  1096.08568     1
   mclust  1531.69475  1531.69475  1531.69475  1531.69475  1531.69475  1531.69475     1
 d.apclus    28.56100    28.56100    28.56100    28.56100    28.56100    28.56100     1
  clusGap  1096.50680  1096.50680  1096.50680  1096.50680  1096.50680  1096.50680     1
  NbClust 10940.98807 10940.98807 10940.98807 10940.98807 10940.98807 10940.98807     1

I found the function pamk in fpc package to be most useful for my requirements.

Nnw answered 22/4, 2022 at 11:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.